FORTRAN Comparisons
Volume Number: 8
Issue Number: 7
Column Tag: Jörg's Folder
FORTRAN Comparisons Ä
The sequel.
By Jörg Langowski, MacTutor Regular Contributing Author
After my September column on numerical precision in the two competing
FORTRAN compilers from Absoft and Language Systems, we received an irate comment
from an Absoft user that this comparison was not fair and I hadn’t compared both
products to their maximum advantage. Excerpt (the less irate part of it) follows, the
author wished that his name not be divulged:
Mr. Langowski runs his benchmarks using opt=3 for Language Systems but
only basic optimization for Absoft. This is patently unfair. Absoft has loop unrolling and
subroutine inclusion opts that he ignores. These greatly speed up the benchmarks. He
also penalizes Absoft by including the -e option.
I have run similar comparisons with Absoft armed in double precision mode, so
all arithmetic is comparable in accuracy. For the Whetstones, on an FX I get over 6000
with Absoft, which is better than what he gets with a Quadra 700. With a Quadra 950, I
get over 13,800 MWhets from Absoft, while I get roughly 5600 from Language
Systems.
I should point out that the Absoft manual discusses the problem of extended
arithmetic and comparisons, and notifies you of the calls to arm the FPU. I agree that
Language Systems has better support for things like Apple Events etc., but lets make
sure comparisons are fair. I should also mention that Absoft routinely compiles my
code, at all comparable levels of optimization, faster than does Language Systems.
In fact, to recommend only one of the compilers was admittedly a little too
strong. I may start with the conclusion of this column, that really both Absoft and LS
Fortran compilers are very good products and have their advantages and disadvantages.
We contacted Absoft, who had been suspiciously silent throughout all this, and asked
them to express their views. We received a very constructive letter that gives a lot of
insight into the tradeoffs that compiler makers have to deal with. Here it is:
“Dear Mr. Langowski,
We would like take to this opportunity to comment on your September column in
MacTutor magazine. It was stated in the article that “Absoft has recently announced
MacFortran II version 3.1.” We actually began shipping version 3.1.2 of MacFortran II
in October of 1991. In addition to the enhancements to the user interface, this version
includes a code generator which takes advantage of the new 68040 floating point
instructions and a math library for FORTRAN intrinsic functions which is based on the
Motorola transcendental function library intended to be used with 68040 based
machines. [in fact, I had used 3.1.2 in the tests; 3.1 was a typo. sorry. -JL].
A significant portion of the article discusses the Paranoia program which, when
compiled “with Absoft MacFortran 3.1.2, with and without optimizations, produces a
lot of error messagestypical of floating point implementations where roundoff is not
handled correctly.” The method developed in the article to achieve a diagnostic free
result by turning off optimization and using the -e option to prevent the compiler from
maintaining variables in registers is not the solution we would have chosen or
recommended. The Paranoia program can also be successfully negotiated by simply
setting the rounding precision of the floating point unit to the precision of the
benchmark. This procedure is described on pages 5-13 through 5-15 of the Porting
Code chapter of our manual. It has the advantage of not obviating optimization and allows
the compiler to maintain values in registers while still performing rounding to the
width of the variable. However, this still leaves the question of whether it is valid for a
compiler to maintain values in registers as long as possible. We feel that it is, although
we do recognize that there are circumstances where control over the side effects of this
optimization must be made available to the programmer. In particular, we provide
several options and mechanisms to assist in the development of numerically sensitive
programs on machines where the register file is wider than main storage or where fast,
but not necessarily IEEE conforming instructions are present. The MC68040 provides
single and double precision rounded basic operations whose use can improve
performance at the cost of extended precision intermediates. In addition, the VOLATILE
statement (a VAX extension) allows control over individual variables.
To further illustrate the situation, consider the following program:
C 1
a = 1.0
b = 3.0
c = a/b
if (c .eq. a/b) print *,'equal'
end
With or without optimization, the MacFortran II compiler generates a program
that correctly prints the string “equal”. On the other hand, under the same conditions,
the Language Systems FORTRAN compiler produces a program that is silent. Does this
indicate problems with the arithmetic generated by the Language Systems compiler? An
inspection of the generated code clearly shows no errors. The Language Systems
compiler exhibits seemingly anomalous behavior precisely because it does not maintain
the variable “c” in a register; its’ precision is truncated to 32 bits when it is stored
in memory, but the comparison with the reloaded variable is made against the full 96
bit result of the division. The Language Systems “-ansi” switch will generate code
which will compare successfully, but I am certain they would not recommend
indiscriminate use of the option. What this example (and Paranoia) does point out are
the problems that a programmer might encounter on a machine where the width of the
register file is greater than the main storage [my underline - JL]. The Microsoft
compiler for Intel based computers will also fail on this example if optimization is
turned off (Intel floating point units are 80 bits wide).
The section of the article which describes the results of the speed tests begins
with the cautionary remark “we should therefore use the Absoft compiler at least with
the -e option, and maybe also drop the optimizations.” We urge you to reconsider your
conclusions as there is a large body of problems that is not sensitive to environments
where greater precision is maintained in registers than in memory. To relegate these
programs to slower than optimal performance achieves no useful end. Numerically
sensitive programs that explore the boundaries of precision are often better served by
setting the floating point unit to the rounding state they expect.
We noticed that several recommended options were not used when running the
benchmarks. In particular, subroutine folding and loop unrolling. Although we have not
used the Whetstone benchmark for comparison with our competitors on the Macintosh
for over a year, it can dramatically demonstrate some of the performance benefits of
certain optimization techniques. The real advantage of subroutine inlining or folding is
typically not the elimination of the call-return sequence, but rather the opportunities
for further optimizations that it exposes to the compiler. This is in fact what happens
when the P3 subroutine is folded in the Whetstone benchmark. The compiler is able to
determine that the loop is completely invariant and can set the result values without
performing a single loop iteration. As small encapsulated functions dictated by modern
programming paradigms become more commonplace, this optimization technique will
yield even greater performance improvements.
When the -O option (basic optimizations) is used, innermost loops which consist
of a single executable statement are automatically unrolled as is indicated on page 4-16
of our manual. This is the case in the saxpy subroutine in the Linpack benchmark. As
instruction and data cache sizes become larger, multiple execution units (super-scalar
processors) are introduced, and register files are expanded, loop unrolling becomes a
very powerful optimization. It allows a compiler to maintain more values in registers,
schedule code for the various execution units, and group data loads and stores in a
attempt to minimize memory traffic.
When comparing the capabilities of two different compilers on the same piece of
hardware, we feel that they each should be shown off to their greatest advantage.
Sincerely,
Peter A. Jacobson”
Thank you very much for taking the time to reply. It was an oversight that I had
not looked into changing the precision of the FPU for running Paranoia. When you set
the FPU to single precision, the generated code passes in fact all the numerical
accuracy tests. (for an example how to do this, see the listing).
It is probably a question of philosophy how to handle the situation when the
register precision of the FPU differs from that of the memory variables. It is true that
if you keep intermediate values in registers during subexpression evaluation, you may
get results that differ slightly from the same set of Fortran instructions when the
intermediate results have been stored in memory. The speed advantage of using internal
registers to their full extent then goes together with the need for controlling yourself
the FPU precision in numerically sensitive parts of your code; if such a piece of code is
written using 32-bit single precision variables, you should set the FPU to single
precision rounding, and reset it to its original state after you’re done with that
particular routine. I suppose that is not too much work when you are developing e.g., a
fast math library.
But let’s look at the optimization issue, which is where the two compilers really
differ. In order to gain a fair comparison on the general-purpose quality of the code
produced by both compilers, I chose those optimization levels on both systems that gave
the best results on the Linpack benchmark (and incidentally also on a Monte-Carlo and
a Brownian Dynamics simulation ‘real-world’ problem that I am currently dealing
with). Those parameters were for LS Fortran: opt=3 and for Absoft: basic
optimizations, no subroutine folding, loop unrolling level 2 or none at all (no big